motivation:

Clinical genetic testing returns results that are categorized by testing labs into one of 5 categories: Benign Likely benign Variant of Uncertain Significance (VUS) Likely Pathogenic Pathogenic These classifications can change over time. We do not know how frequently variants change classification. We do not know if the rate at which variants are reclassified is different amongst different ancestry groups. We do not know if the rate at which variants are reclassified is different across different disease categories. The goal of this project will be to discover these differences across all of ClinVar.”

load packages

#read in data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(purrr)
library(fs)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(arrow)
## Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
## 
## Attaching package: 'arrow'
## 
## The following object is masked from 'package:lubridate':
## 
##     duration
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
library(ggpie)
library(ggplot2)
library(reactable)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library("ggrepel")
library(corrplot)
## corrplot 0.92 loaded
library(ggeffects)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

load original data (retrieved on 05-23-2020)

number of variants

merged_csv %>% distinct(variation_id) %>% nrow()
## [1] 2278097

number of variants per classification category

merged_csv %>% distinct(submitted_classification, variation_id) %>% group_by(submitted_classification) %>% summarise(count=n())

number of variants per method type

merged_csv %>% distinct(method_type, variation_id) %>% group_by(method_type) %>% summarise(count=n()) %>% arrange(desc(count))

number of variants per method type

filtering steps

  1. merge all the CLINVAR CSV files (variant submission).
  2. filter the data by specific labels of classification, which are benign, pathogenic, likely pathogenic, pathologic, uncertain significance, likely benign, and unknown significance.
  3. filter by “method_type”.

we added CLINVAR disease information and gnomad ancestry frequencies

  1. merge gnomAD ancestry info to CLINVAR using annotation technique from vcfanno
  2. add trait mapping information. Take note of rows with NAs
  • make summary plots for CLINVAR and gnomAD annotated data for just allele frequency alone.
  1. join the gnomAD annotated CLINVAR data with the filtered CLINVAR data, which includes trait-mapping info.
  2. We also selected “clinical testing” as the method used for testing.

The GNOMAD data were filtered by selecting only the allele frequency (AF) for the major ancestry groups, where the ancestry group with the max AF was selected for each variant. The GNOMAD data includes gene information, which we did not need to add. For the CLINVAR data, we calculated the first date of occurence for a submission and the last, i.e the first time a variant occured, and then the last occurence.

The structure of the data (CLINVAR merged with GNOMAD) should haveone variant per line per submission (date of occurence), but the trait of disease /phenotype associated can make it multi-level. To fix this, do: 8. collapse the multiple traits that are assigned to each variant per submission date, into a parent trait per line. We used the MESH term hierarchical structure as a selection guide.

We collapsed the traits into one line per variant and per submission. Select a second term from the root of the MESH terms; and then decide on what patterns of variant reclassifications are categorized as “upgrade” or “downgrade”, and integrate this information into the data. We wrote a function in R to take the processed CLINVAR data, and annotate it with the last or root MESH terms for traits.

We annotated the processed CLINVAR data with Dr. Drivas’ manual classification of a submitted variation such as : “Conflicting/Complex”, “Downgrade Path to Benign”, “Downgrade Path to VUS”, “Downgrade VUS to Benign”, “No Change”, “Upgrade Benign to Path”, “Upgrade Benign to VUS”, and “Upgrade VUS to Path”.

We added Dr. Drivas’s manually curated classification with the filtered and pre-processed CLINVAR data

  1. to add gene and disease info, we extracted the full tables from MeSH db, MeSH.AOR.db and MeSH.PCR.db that is housed in bioconductor, using R. Joined each of these tables together based on the keys (either MESHID or OFFSPRING). We then mapped the MESHID of the final joined table to a MedGen ID, which is present in the current processed CLINVAR data. Next, We joined the mapped table to the CLINVAR table, which now has the last MESH terms of each variant submission, separated in one line / row.

final data (processed)

load filtered and processed data

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing = readRDS("/project/drivas_shared/projects/CLINVAR/rds_obj/clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing.rds")

number of variants

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% distinct(variation_id) %>% nrow()
## [1] 1054672

search this table by gene name, variant id, etc

count of variants by reclassification and between ancestry

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
  filter(!is.na(class_merged)) %>%
#  mutate(bin_change = if_else(Classification == "No Change", "No Change", "Change")) %>%
 group_by(class_merged, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% 
  ungroup() %>% 
  group_by(ethnic_group_AFmax) %>% 
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) + 
  geom_bar(stat="identity") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) + 
  ylab(label = "percent count") + 
  xlab(label = "classification") +
  labs(color=NULL)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.

drop “no change” category

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
  filter(!is.na(class_merged), class_merged != "No Change") %>%
#  mutate(bin_change = if_else(Classification == "No Change", "No Change", "Change")) %>%
 group_by(class_merged, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% 
  ungroup() %>% 
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) + 
  geom_bar(stat="identity") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) + 
  ylab(label = "percent count") + 
  xlab(label = "classification") +
  labs(color=NULL)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.

count of variants per ancestry group by reclassification (change or no change)

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  group_by(ethnic_group_AFmax,class_merged) %>% 
  filter(!is.na(class_merged)) %>% 
  mutate(bin_change = if_else(class_merged == "No Change", "0", "1")) %>% 
  summarise(count=n()) %>% 
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  ylab(label = "percent count") + 
  xlab(label = "ancestry")
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
  filter(!is.na(class_merged)) %>%
  mutate(bin_change = if_else(class_merged == "No Change", "No Change", "Change")) %>%
 group_by(ethnic_group_AFmax,bin_change) %>% 
  summarise(count=n()) %>%
  ungroup() %>% 
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = bin_change)) + 
  geom_bar(stat="identity") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) + 
  ylab(label = "percent count") + 
  xlab(label = "ancestry") +
  labs(color=NULL)
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.

recode reclassification

we recode the classification, where “No Change” = “0”,and and change = “1”

glm_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(variation_id, .keep_all = TRUE) %>% 
  filter(!is.na(class_merged)) %>% 
  mutate(bin_change = if_else(class_merged == "No Change", "0", "1")) %>% 
  spread(key = ethnic_group_AFmax, value = ethnic_group_AFmax_value) %>% 
  select(bin_change, AF_afr, AF_amr, AF_asj, AF_eas, AF_fin, AF_nfe, AF_oth, AF_sas) %>% 
  mutate(bin_change = as.numeric(bin_change)) 

we recode the reclassification, where “No change” is removed

glm_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(variation_id, .keep_all = TRUE) %>% 
  filter(!is.na(class_merged), class_merged != "No Change") %>% 
#  mutate(bin_change = if_else(Classification == "No Change", "0", "1")) %>% 
  spread(key = ethnic_group_AFmax, value = ethnic_group_AFmax_value) %>% 
  select(class_merged, AF_afr, AF_amr, AF_asj, AF_eas, AF_fin, AF_nfe, AF_oth, AF_sas)
#  mutate(bin_change = as.numeric(bin_change)) 

create contigency table

density plot: max AF by ethnic group

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  ggplot(aes(x=ethnic_group_AFmax_value, color=ethnic_group_AFmax)) + 
  geom_density() + 
  scale_y_continuous(trans="log10", name="density (log10)") + 
  labs(x = "ancestry group AF max") + 
  theme_classic() + guides(color = guide_legend(title = "ancestry groups"))
## Warning: Transformation introduced infinite values in continuous y-axis

density plot:

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  ggplot(aes(x=AF_value, color=AF_ethnic_group)) + 
  geom_density() + 
  scale_y_continuous(trans="log10", name="density (log10)") + 
  labs(x = "max AF") + 
  theme_classic() + 
  theme(legend.position = "none")

count of variants per last MESH term

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(last_mesh_term, variation_id) %>% 
  group_by(last_mesh_term) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))

count of variants per reclassication pattern

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(pattern, variation_id) %>% 
  group_by(pattern) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))

count of variants per disease association

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(medgen_name, variation_id) %>% 
  group_by(medgen_name) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))

count of variants per trait type

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(trait_type, variation_id) %>% 
  group_by(trait_type) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))

count of variants per disease per reclassification

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(medgen_name, class_merged, variation_id) %>% 
  group_by(medgen_name, class_merged) %>% 
  summarise(count=n()) %>% arrange(desc(count))
## `summarise()` has grouped output by 'medgen_name'. You can override using the
## `.groups` argument.

distribution of associated disease by ancestry group

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  distinct(ethnic_group_AFmax, last_mesh_term) %>% 
  group_by(ethnic_group_AFmax, last_mesh_term) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.

plot piechart

pie_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  group_by(submitted_classification) %>% 
  summarise(count=n()) %>% 
  ungroup() %>% 
  mutate(perc = count/sum(count)) %>% 
  arrange(perc) %>% 
  mutate(labels = scales::percent(perc))


# Create a basic bar
pie = ggplot(pie_dat, aes(x="", y=perc, fill=submitted_classification)) + geom_bar(stat="identity", width=1)
 
# Convert to pie (polar coordinates) and add labels
pie = pie + coord_polar("y", start=0) + geom_text(aes(label = paste0(round(perc*100), "%")), position = position_stack(vjust = 0.5)) + theme_void()
 
# Add color scale (hex colors)
pie = pie + scale_fill_manual(values=c("#55DDE0", "#33658A", "#2F4858", "#F6AE2D", "#F26419", "#999999")) 
 
# Remove labels and add title
pie = pie + labs(x = NULL, y = NULL, fill = NULL, title = "submitted variant classification")
 
# Tidy up the theme
pie = pie + theme_classic() + theme(axis.line = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          plot.title = element_text(hjust = 0.5, color = "#666666"))

pie

========identifying differences/association in reclassification of variants between ancestry groups. ================

##prepare count data
chisq_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(Class = case_when(class_merged == "Upgrade VUS to Path"  ~ "Up VUS-Path",
                         class_merged == "Upgrade Benign to VUS"  ~ "Up Benign-VUS",
                         class_merged == "Upgrade Benign to Path"  ~ "U Benign-Path",
                         class_merged == "Downgrade VUS to Benign"  ~ "Down VUS-Benign",
                         class_merged == "Downgrade Path to VUS"  ~ "Down Path-VUS",
                         class_merged == "Downgrade Path to Benign"  ~ "Down Path-Benign",
#                         Classification == "Downgrade Path to VUS"  ~ "Downgrade Path to VUS",
                            TRUE ~ "Conflicting/Complex")
      ) %>% 
group_by(class_merged, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% 
as.data.frame() %>%
spread(key = class_merged, value = count)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.
rownames(chisq_dat) = chisq_dat$ethnic_group_AFmax
chisq_dat$ethnic_group_AFmax = NULL
chisq_dat_mat = as.matrix(chisq_dat)

visualize the contigency table

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
dt <- as.table(as.matrix(chisq_dat))
balloonplot(t(dt), main ="count of variants with change in classification by ancestry", xlab ="", ylab="",label = FALSE, show.margins = FALSE, zlab = FALSE)

chisq_dat_mat
##        Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr                 867                      219                   772
## AF_amr                 499                      164                  1141
## AF_asj                1599                      251                   272
## AF_eas                 984                      137                   531
## AF_fin                 844                      173                   206
## AF_nfe                2637                      384                  2262
## AF_oth                 713                      140                   633
## AF_sas                 983                      205                   882
##        Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr                   39392                    107                  7075
## AF_amr                   20770                     48                  3988
## AF_asj                   16661                     74                  3371
## AF_eas                   20990                    289                  3986
## AF_fin                    7067                    108                  1331
## AF_nfe                   41109                    151                  8143
## AF_oth                   15275                     46                  3077
## AF_sas                   23239                     55                  4184
##        Upgrade VUS to Path
## AF_afr                2399
## AF_amr                2251
## AF_asj                 798
## AF_eas                1649
## AF_fin                 646
## AF_nfe                6399
## AF_oth                1502
## AF_sas                2107

Chi-square test

The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.

Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent

For each cell of the table, we have to calculate the expected value under null hypothesis.

For a given cell, the expected value is calculated as follow:

e=row.sum∗col.sum/grand.total

The Chi-square statistic is calculated as follow:

χ2=∑(o−e)2e

o is the observed value
e is the expected value

This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1)

degrees of freedom and p = 0.05.

r is the number of rows in the contingency table
c is the number of column in the contingency table

If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.

Compute chi-square test in R

Chi-square statistic can be easily computed using the function chisq.test() as follow:

chisq = chisq.test(chisq_dat_mat)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  chisq_dat_mat
## X-squared = 6561.4, df = 42, p-value < 2.2e-16

the row (ancestry groups) and the column (change in classification) variables are statistically significantly associated (p-value = 0).

The observed and the expected counts can be extracted from the result of the test as follow:

# Observed counts
chisq$observed
##        Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr                 867                      219                   772
## AF_amr                 499                      164                  1141
## AF_asj                1599                      251                   272
## AF_eas                 984                      137                   531
## AF_fin                 844                      173                   206
## AF_nfe                2637                      384                  2262
## AF_oth                 713                      140                   633
## AF_sas                 983                      205                   882
##        Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr                   39392                    107                  7075
## AF_amr                   20770                     48                  3988
## AF_asj                   16661                     74                  3371
## AF_eas                   20990                    289                  3986
## AF_fin                    7067                    108                  1331
## AF_nfe                   41109                    151                  8143
## AF_oth                   15275                     46                  3077
## AF_sas                   23239                     55                  4184
##        Upgrade VUS to Path
## AF_afr                2399
## AF_amr                2251
## AF_asj                 798
## AF_eas                1649
## AF_fin                 646
## AF_nfe                6399
## AF_oth                1502
## AF_sas                2107
# Expected counts
round(chisq$expected,2)
##        Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr             1813.57                   332.47               1331.26
## AF_amr             1029.71                   188.77                755.87
## AF_asj              821.53                   150.60                603.05
## AF_eas             1019.19                   186.84                748.14
## AF_fin              370.16                    67.86                271.72
## AF_nfe             2179.42                   399.54               1599.81
## AF_oth              763.02                   139.88                560.10
## AF_sas             1129.40                   207.04                829.04
##        Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr                36665.45                 174.48               6986.19
## AF_amr                20818.04                  99.07               3966.65
## AF_asj                16609.13                  79.04               3164.69
## AF_eas                20605.25                  98.05               3926.10
## AF_fin                 7483.70                  35.61               1425.94
## AF_nfe                44061.87                 209.68               8395.50
## AF_oth                15426.16                  73.41               2939.28
## AF_sas                22833.40                 108.66               4350.65
##        Upgrade VUS to Path
## AF_afr             3527.58
## AF_amr             2002.90
## AF_asj             1597.96
## AF_eas             1982.43
## AF_fin              720.01
## AF_nfe             4239.18
## AF_oth             1484.15
## AF_sas             2196.80

Nature of the dependence between the ancestry groups and the change in classification variable

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:

r = o − e /√e

The above formula returns the so-called Pearson residuals (r) for each cell (or standardized residuals).

Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.

visualize Pearson residuals

For a given cell, the size of the circle is proportional to the amount of the cell contribution.

The sign of the standardized residuals is also very important to interpret the association between rows and columns as explained below.

1.    Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

2.    In the image below, it’s evident that there are an association between the column "conflicting/complex" and the rows AF_asj, AF_fin; Down Path-VUS and AF_amr, AF_nfe, Down VUS-Benign AF_afr; Up Benign-Path and AF_eas.
There is a strong positive association between the column Up VUS-Path and the row AF-nfe.

3.    Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Up VUS-Path are negatively associated (~ “not associated”) with the row AF-afr and AF_asj. There is a repulsion between the column Down Path-VUS and, the row AF_afr.
corrplot(chisq$residuals, is.corr = FALSE)

The contribution (in %) of a given cell to the total Chi-square score is calculated as follow contrib = r2/χ2

r is the residual of the cell

Contibution in percentage (%)

contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
##        Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr               7.530                    0.590                 3.581
## AF_amr               4.169                    0.050                 2.991
## AF_asj              11.214                    1.020                 2.770
## AF_eas               0.019                    0.203                 0.961
## AF_fin               9.244                    2.483                 0.242
## AF_nfe               1.464                    0.009                 4.177
## AF_oth               0.050                    0.000                 0.145
## AF_sas               0.289                    0.000                 0.052
##        Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr                   3.090                  0.398                 0.017
## AF_amr                   0.002                  0.401                 0.002
## AF_asj                   0.002                  0.005                 0.205
## AF_eas                   0.109                  5.667                 0.014
## AF_fin                   0.354                  2.242                 0.096
## AF_nfe                   3.016                  0.250                 0.116
## AF_oth                   0.023                  0.156                 0.098
## AF_sas                   0.110                  0.404                 0.097
##        Upgrade VUS to Path
## AF_afr               5.503
## AF_amr               0.468
## AF_asj               6.103
## AF_eas               0.855
## AF_fin               0.116
## AF_nfe              16.771
## AF_oth               0.003
## AF_sas               0.056

The relative contribution of each cell to the total Chi-square score give some indication of the nature of the dependency between rows and columns of the contingency table.

It can be seen that:

The column “Up VUS-Path” is strongly associated with AF_nfe.
corrplot(contrib, is.cor = FALSE)

From the image above, it can be seen that the most contributing cells to the Chi-square is AF_nfe/Up VUS-Path

These cells contribute about 21.280 % to the total Chi-square score and thus account for most of the difference between expected and observed values. This confirms the earlier visual interpretation of the data. As stated earlier, visual interpretation may be complex when the contingency table is very large. In this case, the contribution of one cell to the total Chi-square score becomes a useful way of establishing the nature of dependency.

++++++++++++++++++++precdictive modeling+++++++++++++++++++++++++++

perform mean imputation on rows/values with NA

glm_dat = glm_dat %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))

__________________________ Comparative analysis. __________________________

differences in reclassification “Upgrade VUS to Path” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Upgrade VUS to Path", "Upgrade VUS to Path", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

model 1 - Logistic regression: the initial run: Upgrade VUS to Path vs others

##prepare count data
lr_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade VUS to Path", "Upgrade VUS to Path", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4704  -0.4030  -0.3712  -0.3110   2.5932  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.14547    0.01321 -162.39   <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.85964    0.02474  -34.75   <2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.32445    0.02562  -12.66   <2e-16 ***
## factor(ancestry_max_AF)AF_asj -1.18153    0.03838  -30.79   <2e-16 ***
## factor(ancestry_max_AF)AF_eas -0.64712    0.02860  -22.62   <2e-16 ***
## factor(ancestry_max_AF)AF_fin -0.56660    0.04272  -13.26   <2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.43765    0.02984  -14.66   <2e-16 ***
## factor(ancestry_max_AF)AF_sas -0.49528    0.02613  -18.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128956  on 255784  degrees of freedom
## Residual deviance: 126860  on 255777  degrees of freedom
## AIC: 126876
## 
## Number of Fisher Scoring iterations: 6

model 2 - Logistic regression: the initial run: Upgrade Benign to VUS vs others

differences in reclassification “Upgrade Benign to VUS” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Upgrade Benign to VUS", "Upgrade Benign to VUS", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to VUS", "Upgrade Benign to VUS", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5627  -0.5475  -0.5454  -0.5325   2.0266  
## 
## Coefficients:
##                                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                   -1.872038   0.011903 -157.268  < 2e-16 ***
## factor(ancestry_max_AF)AF_afr  0.049977   0.017490    2.858 0.004270 ** 
## factor(ancestry_max_AF)AF_amr  0.041545   0.020800    1.997 0.045787 *  
## factor(ancestry_max_AF)AF_asj  0.108916   0.022118    4.924 8.47e-07 ***
## factor(ancestry_max_AF)AF_eas  0.052893   0.020815    2.541 0.011049 *  
## factor(ancestry_max_AF)AF_fin -0.044133   0.031679   -1.393 0.163586    
## factor(ancestry_max_AF)AF_oth  0.088601   0.022832    3.881 0.000104 ***
## factor(ancestry_max_AF)AF_sas -0.009825   0.020423   -0.481 0.630468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 204776  on 255784  degrees of freedom
## Residual deviance: 204726  on 255777  degrees of freedom
## AIC: 204742
## 
## Number of Fisher Scoring iterations: 4

model 3 - Logistic regression: the initial run: Upgrade Benign to Path vs others

differences in reclassification “Upgrade Benign to Path” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Upgrade Benign to Path", "Upgrade Benign to Path", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to Path", "Upgrade Benign to Path", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1447  -0.0704  -0.0656  -0.0649   3.5774  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -6.00027    0.08148 -73.641   <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.16106    0.12651  -1.273   0.2030    
## factor(ancestry_max_AF)AF_amr -0.39711    0.16585  -2.394   0.0166 *  
## factor(ancestry_max_AF)AF_asj  0.26317    0.14211   1.852   0.0640 .  
## factor(ancestry_max_AF)AF_eas  1.41689    0.10067  14.075   <2e-16 ***
## factor(ancestry_max_AF)AF_fin  1.44571    0.12647  11.431   <2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.13943    0.16860  -0.827   0.4082    
## factor(ancestry_max_AF)AF_sas -0.35331    0.15765  -2.241   0.0250 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11717  on 255784  degrees of freedom
## Residual deviance: 11252  on 255777  degrees of freedom
## AIC: 11268
## 
## Number of Fisher Scoring iterations: 9

model 4 - Logistic regression: the initial run: Downgrade VUS to Benign vs others

differences in reclassification “Downgrade VUS to Benign” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Downgrade VUS to Benign", "Downgrade VUS to Benign", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade VUS to Benign", "Downgrade VUS to Benign", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8900  -0.8112  -0.7851   1.4951   1.7271  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.721696   0.008625 -83.678   <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.514839   0.013682 -37.629   <2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.221062   0.015688 -14.091   <2e-16 ***
## factor(ancestry_max_AF)AF_asj -0.240561   0.017074 -14.089   <2e-16 ***
## factor(ancestry_max_AF)AF_eas -0.297365   0.015938 -18.658   <2e-16 ***
## factor(ancestry_max_AF)AF_fin -0.037397   0.022764  -1.643      0.1    
## factor(ancestry_max_AF)AF_oth -0.194432   0.017421 -11.161   <2e-16 ***
## factor(ancestry_max_AF)AF_sas -0.294002   0.015370 -19.128   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 302697  on 255784  degrees of freedom
## Residual deviance: 301110  on 255777  degrees of freedom
## AIC: 301126
## 
## Number of Fisher Scoring iterations: 4

model 5 - Logistic regression: the initial run: Downgrade Path to VUS vs others

differences in reclassification “Downgrade Path to VUS” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Downgrade Path to VUS", "Downgrade Path to VUS", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade Path to VUS", "Downgrade Path to VUS", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9795   0.1749   0.2377   0.2747   0.2840  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    3.25828    0.02143 152.069  < 2e-16 ***
## factor(ancestry_max_AF)AF_afr  0.91369    0.04212  21.691  < 2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.06803    0.03703  -1.837   0.0662 .  
## factor(ancestry_max_AF)AF_asj  1.16841    0.06465  18.073  < 2e-16 ***
## factor(ancestry_max_AF)AF_eas  0.70816    0.04876  14.522  < 2e-16 ***
## factor(ancestry_max_AF)AF_fin  0.64094    0.07356   8.713  < 2e-16 ***
## factor(ancestry_max_AF)AF_oth  0.23169    0.04568   5.072 3.94e-07 ***
## factor(ancestry_max_AF)AF_sas  0.29392    0.04032   7.290 3.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62022  on 255784  degrees of freedom
## Residual deviance: 60977  on 255777  degrees of freedom
## AIC: 60993
## 
## Number of Fisher Scoring iterations: 7

model 6 - Logistic regression: the initial run: Downgrade Path to Benign vs others

differences in reclassification “Downgrade Path to Benign” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Downgrade Path to Benign", "Downgrade Path to Benign", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade Path to Benign", "Downgrade Path to Benign", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3007   0.0981   0.1123   0.1140   0.1834  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    5.06307    0.05119  98.903  < 2e-16 ***
## factor(ancestry_max_AF)AF_afr  0.37980    0.08489   4.474 7.68e-06 ***
## factor(ancestry_max_AF)AF_amr  0.10161    0.09356   1.086  0.27746    
## factor(ancestry_max_AF)AF_asj -0.55511    0.08154  -6.808 9.91e-12 ***
## factor(ancestry_max_AF)AF_eas  0.27211    0.09978   2.727  0.00639 ** 
## factor(ancestry_max_AF)AF_fin -0.98603    0.09219 -10.696  < 2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.04079    0.09905  -0.412  0.68046    
## factor(ancestry_max_AF)AF_sas -0.02993    0.08678  -0.345  0.73018    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20164  on 255784  degrees of freedom
## Residual deviance: 19931  on 255777  degrees of freedom
## AIC: 19947
## 
## Number of Fisher Scoring iterations: 8

model 7 - Logistic regression: the initial run: Conflicting/Complex vs others

differences in reclassification “Conflicting/Complex” group between ancestry

we recode the classification, where “No change” is removed

clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
#  distinct(variation_id, .keep_all = TRUE) %>% 
#  filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
  mutate(class = if_else(class_merged == "Conflicting/Complex", "Conflicting/Complex", "other class")) %>% 
  group_by(class, ethnic_group_AFmax) %>% 
  summarise(count=n()) %>% ungroup() %>%
  group_by(ethnic_group_AFmax) %>%
  mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>% 
  filter(!is.na(class), class != "other class") %>%
  ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) + 
  ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.

##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Conflicting/Complex", "Conflicting/Complex", "other class")) %>%
  rename(ancestry_max_AF = ethnic_group_AFmax) %>% 
#  mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
  data = DF,
  family = "binomial"
)

# print results
summary(m2)
## 
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial", 
##     data = DF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8535   0.1868   0.2604   0.2971   0.4119  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    3.09850    0.01991 155.641  < 2e-16 ***
## factor(ancestry_max_AF)AF_afr  0.95552    0.03961  24.120  < 2e-16 ***
## factor(ancestry_max_AF)AF_amr  0.94170    0.04934  19.084  < 2e-16 ***
## factor(ancestry_max_AF)AF_asj -0.50322    0.03269 -15.396  < 2e-16 ***
## factor(ancestry_max_AF)AF_eas  0.23480    0.03806   6.169 6.89e-10 ***
## factor(ancestry_max_AF)AF_fin -0.67434    0.04106 -16.423  < 2e-16 ***
## factor(ancestry_max_AF)AF_oth  0.26861    0.04298   6.250 4.11e-10 ***
## factor(ancestry_max_AF)AF_sas  0.34200    0.03803   8.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78760  on 255784  degrees of freedom
## Residual deviance: 76593  on 255777  degrees of freedom
## AIC: 76609
## 
## Number of Fisher Scoring iterations: 6

sankey: reclassification pathway of variant’s pathogenicity

library(ggsankey)
plot_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% 
  select(submitted_classification, class_merged)

plot_dat_lg = plot_dat %>% filter(class_merged != "No Change") %>% make_long(submitted_classification, class_merged)

pl <- ggplot(plot_dat_lg, aes(x = x,                        
                     next_x = next_x,                                     
                     node = node,
                     next_node = next_node,        
                     fill = factor(node),
                     label = node))             # This Creates a label for each node
                     
pl <- pl +geom_sankey(flow.alpha = 0.5,          #This Creates the transparency of your node 
                      node.color = "black",     # This is your node color        
                      show.legend = TRUE)        # This determines if you want your legend to show

pl <- pl + geom_sankey_label(Size = 3, 
                             color = "black", 
                             fill = "white") # This specifies the Label format for each node 
## Warning in geom_sankey_label(Size = 3, color = "black", fill = "white"):
## Ignoring unknown parameters: `Size`
pl <- pl + theme_bw()
pl <- pl + theme(legend.position = 'none')
pl <- pl + theme(axis.title = element_blank(),
                 axis.text.y = element_blank(),
                 axis.ticks = element_blank(),
                 panel.grid = element_blank())

pl <- pl + scale_fill_viridis_d(option = "inferno")
pl <- pl + labs(title = "reclassification of variants")
pl <- pl + labs(subtitle = "CLINVAR: 05-20-2023 release")
pl <- pl + labs(caption ="Trust Odia and Theodore Drivas" )
pl <- pl + labs(fill = 'Nodes')
pl